Search CORE

70 research outputs found

Improving utilization of heterogeneous clusters

Author: Bosque Orero José Luis
Stafford Fernández Esteban
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2020
Field of study

Datacenters often agglutinate sets of nodes with different capabilities, leading to a sub-optimal resource utilization. One of the best ways of improving utilization is to balance the load by taking into account the heterogeneity of these clusters. This article presents a novel way of expressing computational capacity, more adequate for heterogeneous clusters, and also advocates for task migration in order to further improve the utilization. The experimental evaluation shows that both proposals are advantageous and allow improving the utilization of heterogeneous clusters and reducing the makespan to 16.7% and 17.1%, respectively.This work has been supported by the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network) and the European HiPEAC Network of Excellenc

UCrea

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

Author: Ayguadé Parra Eduard
Beivide Palacio Ramon
Bosque Jose L.
Martorell Bofill Xavier
Mateo Sergi
Pérez Borja
Stafford Esteban
Teruel Xavier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016- 76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493). The European Research Council under grant agreement No 321253 European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc Projects, grant agreement n 288777, 610402 and 671697 and the European HiPEAC Network.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

Author: Beivide Palacio Ramón
Bosque Orero José Luis
Pérez Pavón Borja
Stafford Fernández Esteban
Publication venue: 'Elsevier BV'
Publication date: 01/11/2021
Field of study

A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

UCrea

To distribute or not to distribute: The question of load balancing for performance or energy

Author: Beivide Palacio Ramon
Bosque Jose L.
Pérez Borja
Stafford Esteban
Valero Cortés Mateo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computin effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.This work has been supported by the University of Cantabria (CVE-2014-18166), the Spanish Science and Technology Commission (TIN2016-76635-C2-2-R), the European Research Council (G.A. No 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 671697.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Un sistema para la docencia a distancia en asignaturas con hardware real

Author: Fuentes Saez Pablo
Mateev Mateev Vladimir Kililov
Stafford Fernández Esteban
Publication venue
Publication date: 01/01/2021
Field of study

La docencia práctica en laboratorio de asignaturas centradas en el hardware como las del área de Estructura y Organización de Computadores se ha visto severamente afectada por el COVID-19. En este artículo se introduce un nuevo sistema de laboratorio remoto para la realización de sesiones prácticas basadas en Raspberry Pi ejecutando el sistema operativo RISC OS. El sistema gestiona tanto la alimentación de los equipos como la entrada/salida realizada a través de dispositivos periféricos, y permite al alumno visualizar e interaccionar con el escritorio del equipo remoto y con los dispositivos hardware conectados al mismo. Asimismo, el sistema facilita que un alumno y un profesor puedan visualizar el equipo remoto de forma simultánea en tiempo real, lo que facilita la resolución de dudas y la realización de pruebas de evaluación. El sistema combina una lógica de control basada en módulos Arduino y conexiones Ethernet con una interfaz web programada en lenguaje PHP. Con estas especiaciones, se ha desarrollado con éxito una prueba de concepto dotada de dos equipos remotos y dos interfaces de entrada.Los autores agradecen la colaboración de Fernando Vallejo, Carmen Martínez y Cristóbal Camarero. Este trabajo ha sido parcialmente financiado por la V Convocatoria de Proyectos de Innovación Docente, del Vicerrectorado de Ordenación Académica y Profesorado de la Universidad de Cantabria

UCrea

Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

Author: Banchelli Fabio
Criado-Ledesma Joel
Garcia-Gasulla Marta
Gracia José
Josep-Fabrego Marc
Mantovani Filippo
Nachtmann Mathias
Stafford Esteban
Publication venue: 'Elsevier BV'
Publication date: 10/07/2020
Field of study

In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercomputer, the first Arm-based supercomputer entering the Top500 in November 2018. We study from micro-benchmarks up to large production codes. We include an interdisciplinary evaluation of three scientific applications (a finite-element fluid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph 500 benchmark, focusing on parallel and energy efficiency as well as studying their scalability up to thousands of Armv8 cores. For comparison, we run the same tests on state-of-the-art x86 nodes included in Dibona and the Tier-0 supercomputer MareNostrum4. Our experiments show that the ThunderX2 has a 25% lower performance on average, mainly due to its small vector unit yet somewhat compensated by its 30% wider links between the CPU and the main memory. We found that the software ecosystem of the Armv8 architecture is comparable to the one available for Intel. Our results also show that ThunderX2 delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Author: Ayguadé E.
Beivide Palacio Ramón
Bosque Orero José Luis
Martorell X.
Mateo S.
Pérez Pavón Borja
Stafford Fernández Esteban
Teruel X.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

The emergence of heterogeneous systems has been very notable recently. The nodes of the most powerful computers integrate several compute accelerators, like GPUs. Profiting from such node configurations is not a trivial endeavour. OmpSs is a framework for task based parallel applications, that allows the execution of OpenCl kernels on different compute devices. However, it does not support the co-execution of a single kernel on several devices. This paper presents an extension of OmpSs that rises to this challenge, and presents Auto-Tune, a load balancing algorithm that automatically adjusts its internal parameters to suit the hardware capabilities and application behavior. The extension allows programmers to take full advantage of the computing devices with negligible impact on the code. It takes care of two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. Experimental results reveal that the co-execution of single kernels on all the devices in the node is beneficial in terms of performance and energy consumption, and that Auto-Tune gives the best overall results.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016-76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UCrea

UPCommons. Portal del coneixement obert de la UPC

Assessing the Suitability of King Topologies for Interconnection Networks

Author: Beivide Palacio Ramón
Bosque Orero José Luis
Camarero Coterillo Cristobal
Castillo Villar Emilio
Martínez Fernández María del Carmen
Stafford Fernández Esteban
Vallejo Alonso Fernando
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

In the late years many different interconnection networks have been used with two main tendencies. One is characterized by the use of high-degree routers with long wires while the other uses routers of much smaller degree. The latter rely on two-dimensional mesh and torus topologies with shorter local links. This paper focuses on doubling the degree of common 2D meshes and tori while still preserving an attractive layout for VLSI design. By adding a set of diagonal links in one direction, diagonal networks are obtained. By adding a second set of links, networks of degree eight are built, named king networks. This research presents a comprehensive study of these networks which includes a topological analysis, the proposal of appropriate routing procedures and an empirical evaluation. King networks exhibit a number of attractive characteristics which translate to reduced execution times of parallel applications. For example, the execution times NPB suite are reduced up to a 30 percent. In addition, this work reveals other properties of king networks such as perfect partitioning that deserves further attention for its convenient exploitation in forthcoming high-performance parallel systems

UCrea

Temperature Variations from HST Spectroscopy of the Orion Nebula

Author: Baldwin
Bally
Bohlin
Bohlin
Bowers
Burke
Esteban
Esteban
Froese Fischer
G. J. Ferland
Grevesse
Harrington
J. A. Baldwin
J. F. Nguyen
K. P. M. Blagrave
Kingdon
Kwitter
Leitherer
Lennon
Liu
Liu
Liu
Liu
Luo
Martin
O'Dell
O'Dell
Osterbrock
P. G. Martin
Peimbert
Peimbert
Peimbert
Pogge
Péquignot
R. H. Rubin
R. J. Dufour
Rubin
Rubin
Rubin
Rubin
Rubin
Rubin
Rubin
Stafford
Storey
Tsamis
Viegas
Walsh
X.-W. Liu
Publication venue: 'Wiley'
Publication date: 10/12/2002
Field of study

We present HST/STIS long-slit spectroscopy of NGC 1976. Our goal is to measure the intrinsic line ratio [O III] 4364/5008 and thereby evaluate the electron temperature (T_e) and the fractional mean-square T_e variation (t_A^2) across the nebula. We also measure the intrinsic line ratio [N II] 5756/6585 in order to estimate T_e and t_A^2 in the N^+ region. The interpretation of the [N II] data is not as clear cut as the [O III] data because of a higher sensitivity to knowledge of the electron density as well as a possible contribution to the [N II] 5756 emission by recombination (and cascading). We present results from binning the data along the various slits into tiles that are 0.5" square (matching the slit width). The average [O III] temperature for our four HST/STIS slits varies from 7678 K to 8358 K; t_A^2 varies from 0.00682 to at most 0.0176. For our preferred solution, the average [N II] temperature for each of the four slits varies from 9133 K to 10232 K; t_A^2 varies from 0.00584 to 0.0175. The measurements of T_e reported here are an average along each line of sight. Therefore, despite finding remarkably low t_A^2, we cannot rule out significantly larger temperature fluctuations along the line of sight. The result that the average [N II] T_e exceeds the average [O III] T_e confirms what has been previously found for Orion and what is expected on theoretical grounds. Observations of the proplyd P159-350 indicate: large local extinction associated; ionization stratification consistent with external ionization by theta^1 Ori C; and indirectly, evidence of high electron density.Comment: MNRAS accepted: 30 pages, 3 Figures, 2 Table

arXiv.org e-Print Archive

Crossref

University of Kentucky

CERN Document Server

Energy efficiency of load balancing for data-parallel applications in heterogeneous systems

Author: Borja Pérez
E Castillo
Esteban Stafford
G León
José Luis Bosque
KE Niemeyer
O Beaumont
P Benner
Ramón Beivide
S Hong
Suleyman Tosun
X Cai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The use of heterogeneous systems in supercomputing is on the rise as they improve both performance and energy e ciency. However, the pro- gramming of these machines requires considerable e ort to get the best results in massively data-parallel applications. Maat is a library that enables OpenCL programmers to e ciently execute single data-parallel kernels using all the available devices on a heterogeneous system. It o ers a set of load balanc- ing methods, which perform the data partitioning and distribution among the devices, exploiting more of the performance of the system and consequently re- ducing execution time. Until now, however, a study of the implications of these on the energy consumption has not been made. Therefore, this paper analyses the energy e ciency of the di erent load balancing methods compared to a baseline system that uses just a single GPU. To evaluate the impact of the heterogeneity of the system, the GPUs were set to di erent frequencies. The obtained results show that in all the studied cases there is at least one load balancing method that improves energy e ciency

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UCrea